Strong Scaling Study with PackageCompiler

Adam Lyon, Muon g-2 IRMA Analysis, Fermilab @ Home, November 2020

This notebook examines strong scaling properties of Julia IRMA jobs making just one plot. See the strongScaling.jl notebook for background. For this notebook, the PackageCompiler.jl was run to compile functions needed by the job. Doing so improved the Julia startup time significantly.

This notebook answers issue Analyze results from Strong Scaling jobs #18 and code may be found in PR #20. This file is IRMA/analyses/018_StrongScaling/StrongScaling_pkgc.jl.

What is this notebook?

This is a Pluto.jl notebook and the code here is written in Julia. This is like a Jupyter notebook, but with important differences. The most important difference is that the results appear above the code. Another important difference is that Pluto.jl notebooks are reactive. This means that unlike Jupyter notebooks, Pluto.jl notebooks are always in a consistent state. The notebook keeps track of the cell-to-cell dependencies and when a cell changes, the dependent cells update at the same time. This means that while you are looking at a static html representation of the notebook, you can be assured that the notebook is consistent and up-to-date. You'll see that some results have a little triangle next to them. Clicking on that will open an expanded view of the results.

The organization of this notebook is that the main results are replicated at the top, with discussion, in the Results section. The plots are stored in variables which you can see below the plot. You can look in the Code section, which has all code for this notebook, to see how the plot was made.

Introduction

This notebook examines strong scaling properties of my Julia IRMA jobs that make one plot of energy of clusters. Nearly all of the runs used the maximum 32 physical cores on each node (tasks). I ran with 2, 3, 4, 5, 6, 8, 10, 12, 15, and 20 nodes. To look for conflicts between tasks on nodes, I ran 5 nodes with 2, 4, 8, 16, and 32 tasks.

On 11/2, I ran IRMA/jobs/003_StrongScaling/strongScalingJob.jl that was compiled by PackageCompiler.jl from commit 823ce57 of master (PR #22). See this comment in issue #3.

I ran over era 2D data, which amounted to approximately 16B rows. The total HDF5 file size is 250 GB.

PackageCompiler.jl (see documentation) allows for precompilation of functions that are actually used by the Julia script. A shared object (.so) file is loaded at Julia start time. Using PackageCompiler.jl significantly improves the time it takes to load packages and functions do not need to be JIT compiled, speeding up the code. In MPI environment, using PackageCompiler.jl is ideal to avoid each rank doing its own JIT compilation. The downside of PackageCompiler.jl is that the compiled code in the shared object file is fixed, and so if you update packages to new versions, you may have a mistmatch. But given that the environment is pretty carefully controlled, this isn't much of a problem.

This notebook is an update to a previous analysis Pluto notebook at IRMA/analyses/018_StrongScaling/StrongScaling.jl. That analysis did not use PackageCompiler.jl and there seemed to be significant time loading packages and compiling. This notebook using PackageCompiler.jl shows signficant stability and improvement in timings.

There are occaisonal anomalies. I did a 10 node job that took an extremely long time, running it again produced more normal looking timings.

Results

There are several types of results.

Histogram comparison

This was done in the previous notebook without PackageCompiler.jl. I did not re-run this analysis.

MPI Timing information

I record the time in the job with an IRMA Stopwatch. The stopwatch uses MPI.Wtime under the hood. The times are recorded as follows.

LabelMeaning
startAfter packages are loaded, functions are defined, and MPI.Init call
openedFileAfter the h5open statement
openedDataSetAfter the energy dataset is opened (but no data read yet)
determineRangesAfter the ranges to examine are determined with partitionDS
readDataSetAfter the dataset is read (this reads the actual data for the rank)
makeHistogramAfter the data is histogrammed
gatheredHistogramsAfter all the histograms have been gathered to the root rank
reducedHistogramsAfter the histograms have been reduced to the root rank
gatherRankLogsAfter the rank's log info has been gathered to the root rank

Timing of anything before start is not recorded.

Let's look at the timing plots. Note that you can see all the timing plots in the Code section. Some representative plots will be reproduced here.

Here is a run using three nodes (and 32 ranks per node).

Each rank reads about 175,375,305 rows (or one more). These timings are slightly better than without PackageCompiler.jl (package loading is not shown here).

And here's ten nodes...

45.4 ms

Again, we see that some nodes are faster than others for reading.

10.4 μs

Here's a box and whisker plot of total MPI time...

5 μs
958 ns

Let's just look at the reading time and see if that scales appropriately.

3.3 μs
870 ns

There's a bit of variation. Look at the plots in the Code section to see why.

Let's now concentrate on seeing if the reading time scales. We'll take the mean read time for each run...

9.6 μs
765 ns

The expected time is anchored at the 2 node run. So a four node run should be twice as fast if the scaling were perfect. And what we see is pretty close.

10 rows × 4 columns

numNodesreadDataSet_maximumreadDataSet_meanexpectedFromMean
Int64Float64Float64Float64
1236.779134.714434.7144
2325.388923.660923.1429
3418.84517.798917.3572
4515.582514.253813.8858
5612.866511.956911.5715
689.97179.242828.6786
7109.30177.82746.94288
8128.170516.706555.78573
9156.339045.637824.62859
10206.512745.263593.47144

15.9 μs

We can also look at reducing the number of ranks per node to see if there's any significant i/o overhead and/or contention. Here is the read time for 5 nodes but varying the number of ranks per node.

16.9 μs
1.2 μs

Let's calculate the fraction of time in overhead/contention; that is the difference between the measured read time and the expected time divided by the measured read time.

7 μs
911 ns

So the above plot suggests that fewer ranks per node is more efficient. When we ask for 32 nodes, we are possibly wasting about 30% of the read time in i/o overhead and/or contention. Here's a table,...

5 rows × 6 columns

numTasksreadDataSet_maximumreadDataSet_meanexpectedFromMeandiffdiffPerc
Int64Float64Float64Float64Float64Float64
12161.71161.093161.0930.00.0
2483.094882.529180.54661.982512.4022
3846.169345.449940.27335.1766311.3897
41627.752725.947720.13675.8110222.3951
53215.582514.253810.06834.1854629.3638

Remember, there are always 5 nodes in the job.

10.2 μs

Accounting Information

We can extract accounting information from the SLURM batch system to see how long the jobs took.

The total batch time is the total wall clock time for the job. The total Julia time is how much time the srun julia ... command took within the batch script. The MPI Julia time is the amount of time that was recorded within MPI. Note that the latter does not include the time it took to start Julia and load packages as well as the time to write the results to disk.

The difference between the total batch and julia times are likely due to setting up the requested number of nodes. This time appears to be significant when requesting a large number of nodes. The total Julia time shape is puzzling, though it appears it is most efficient around six or seven nodes. Not clear why it increases and why it appears to turn over before twenty nodes.

11.6 μs

Conclusions

These results are improved and more stable by using PackageCompiler.jl. Things that I conclude from this study.

  • Use PackageCompiler.jl

  • 32 tasks per node seems to involve significant i/o overhead and/or contention

  • Configuring a large number of nodes takes a signifcant amount of time.

3.4 ms

Code

6.4 μs
2.3 ms
1.2 s
14.4 s
"/Users//lyon/Development/gm2/data/003_StrongScaling_pkgc/"
1.7 μs
histoFiles32
747 ms

Timing information

5.3 μs
1.9 s
extractNNodesFromFileName (generic function with 1 method)
33.7 μs
71.3 ms
dataFrameFromRankData (generic function with 1 method)
96 μs
1.1 s
19.6 s
873 ms
theNumNodes
207 ms
41.5 s
plotsForRun (generic function with 1 method)
44.1 μs
886 ms
26.1 ms

Plots for run with 15 nodes (32 ranks per node)

19.7 μs
15.1 ms
strongScalingPlot
4.3 s
totalMPITimes

10 rows × 2 columns

numNodestotalTime_maximum
Int64Float64
1243.576
2330.7742
3423.5602
4519.8548
5616.8802
6813.6648
71012.8117
81211.6279
9159.69708
10209.58383
3.2 s
maxMPITimesPlot
413 ms

Let's look at the read time scaling. Determine the average read time

6.2 μs
dataSetReadTimePlot
134 ms
2.4 ms

10 rows × 4 columns

numNodesreadDataSet_maximumreadDataSet_meanexpectedFromMean
Int64Float64Float64Float64
1236.779134.714434.7144
2325.388923.660923.1429
3418.84517.798917.3572
4515.582514.253813.8858
5612.866511.956911.5715
689.97179.242828.6786
7109.30177.82746.94288
8128.170516.706555.78573
9156.339045.637824.62859
10206.512745.263593.47144
1.4 s
readTimePlot
1.1 s

So there does seem to be a litle bit of contention for large number of nodes).

3.4 μs

Examine Fewer tasks per node

3.1 μs
226 μs
extractNTasksFromFileName (generic function with 1 method)
26.7 μs
42.6 ms
302 ms
21.1 μs
strongScalingPlot5
172 ms
total5MPITimes

5 rows × 2 columns

numTaskstotalTime_maximum
Int64Float64
12185.3
2496.1674
3854.2057
41646.0226
53219.8548
219 μs

5 rows × 3 columns

numTaskstotalTime_maximumexpected
Int64Float64Float64
12185.3185.3
2496.167492.65
3854.205746.325
41646.022623.1625
53219.854811.5812
215 ms
max5MPITimesPlot
597 ms

Let's look at the read times...

3.5 μs

5 rows × 6 columns

numTasksreadDataSet_maximumreadDataSet_meanexpectedFromMeandiffdiffPerc
Int64Float64Float64Float64Float64Float64
12161.71161.093161.0930.00.0
2483.094882.529180.54661.982512.4022
3846.169345.449940.27335.1766311.3897
41627.752725.947720.13675.8110222.3951
53215.582514.253810.06834.1854629.3638
957 ms
max5ReadTimesPlot
248 ms

This actually looks nice - very efficient.

Trying it as a log plot...

10.1 μs
max5ReadTimesLogPlot
490 ms

Plot the difference...

5.8 μs
max5ReadTimesDiffPlot
13.7 ms

Examine accounting information

3.4 μs
1.4 s
sacct

60 rows × 25 columns (omitted printing of 19 columns)

JobIDJobNameQOSStateExitCodeNNodes
StringStringString?StringTime…Int64
135817597run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:002
235817597.batchbatchmissingCOMPLETED00:00:001
335817597.externexternmissingCOMPLETED00:00:002
435817597.0juliamissingCOMPLETED00:00:002
535817702run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:003
635817702.batchbatchmissingCOMPLETED00:00:001
735817702.externexternmissingCOMPLETED00:00:003
835817702.0juliamissingCOMPLETED00:00:003
935817713run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:004
1035817713.batchbatchmissingCOMPLETED00:00:001
1135817713.externexternmissingCOMPLETED00:00:004
1235817713.0juliamissingCOMPLETED00:00:004
1335817716run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:005
1435817716.batchbatchmissingCOMPLETED00:00:001
1535817716.externexternmissingCOMPLETED00:00:005
1635817716.0juliamissingCOMPLETED00:00:005
1735817738run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:006
1835817738.batchbatchmissingCOMPLETED00:00:001
8.1 s
slurmLogFiles
1.5 ms
jobIdFromSlurmLogName (generic function with 1 method)
27.5 μs
slurmIds
240 ms
selectDesiredJobIds (generic function with 1 method)
57.4 μs
sacctM

28 rows × 25 columns (omitted printing of 18 columns)

JobIDJobNameQOSStateExitCodeNNodesNCPUS
StringStringString?StringTime…Int64Int64
135817597run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:002128
235817597.0juliamissingCOMPLETED00:00:00264
335817702run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:003192
435817702.0juliamissingCOMPLETED00:00:00396
535817713run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:004256
635817713.0juliamissingCOMPLETED00:00:004128
735817716run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:005320
835817716.0juliamissingCOMPLETED00:00:005160
935817738run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:006384
1035817738.0juliamissingCOMPLETED00:00:006192
1135817773run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:008512
1235817773.0juliamissingCOMPLETED00:00:008256
1335817861run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:0012768
1435817861.0juliamissingCOMPLETED00:00:0012384
1535818006run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:0015960
1635818006.0juliamissingCOMPLETED00:00:0015480
1735818160run_strongScalingJob_compiled.shdebug_hswCOMPLETED00:00:00201280
1835818160.0juliamissingCOMPLETED00:00:0020640
1.4 s
splitIntoJuliaAndTotal (generic function with 1 method)
106 μs
643 ms
runTimesPlot
2.4 s